Decision Trees
Overview
Decision trees are one of the most intuitive models in machine learning. They partition the feature space into axis-aligned rectangular regions by recursively splitting on individual features, then assign a class (classification) or value (regression) to each region. The learning algorithm is beautifully simple: at each node, exhaustively search every feature and every threshold, pick the split that best purifies the resulting child nodes, then repeat recursively until a stopping criterion is met.
Key Concepts
We focus on classification trees below; regression trees follow the same structure but replace impurity with variance reduction.
Gini Impurity
Gini impurity measures how mixed a node is. For a node containing samples from \(K\) classes:
\[\text{Gini} = 1 - \sum_{i=1}^{K} p_i^2\]
where \(p_i\) is the proportion of samples belonging to class \(i\).
- A perfectly
purenode (all one class) has \(\text{Gini} = 0\). - Maximum disorder (uniform distribution across classes) gives the highest Gini. For binary classification with a 50/50 split, \(\text{Gini} = 0.5\).
Gini can be interpreted as the probability that two randomly drawn samples from the node belong to different classes.
Entropy (Alternative Criterion)
Entropy is another common impurity measure, used in the ID3 and C4.5 algorithms:
\[H = -\sum_{i=1}^{K} p_i \log_2 p_i\]
A pure node has \(H = 0\); a 50/50 binary node has \(H = 1\) bit. In practice, Gini and entropy produce very similar trees — Gini is slightly faster to compute (no logarithm), which is why scikit-learn uses it as the default.
Information Gain
Information gain (IG) is the reduction in impurity after a split. For a parent node split into children \(\{C_1, C_2, \dots\}\):
\[\text{IG} = \text{Impurity}(\text{parent}) - \sum_{i} \frac{n_i}{n}\,\text{Impurity}(C_i)\]
where \(n\) is the sample count at the parent and \(n_i\) is the count in child \(C_i\). The weighting by \(\frac{n_i}{n}\) ensures that larger children contribute more to the weighted impurity. The split that maximizes IG is chosen.
The Split Search Loop (CART Algorithm)
The CART (Classification and Regression Trees) algorithm performs a greedy, exhaustive search at each node:
- For each feature \(j \in \{1, \dots, d\}\):
- Sort the \(n\) samples by feature \(j\)’s values.
- Candidate thresholds are the midpoints between consecutive distinct feature values — so the thresholds come directly from the data, not from an arbitrary grid.
- For each candidate threshold, compute the IG of splitting there.
- Select the feature \(j^*\) and threshold \(t^*\) that produce the highest IG.
- Partition the samples: left child gets \(\{\mathbf{x} \mid x_{j^*} < t^*\}\), right child gets the rest.
- Recurse on each child.
The sorting step dominates: the cost is \(O(n \log n)\) per feature per node. Over the full tree, the total training cost is \(O(d \cdot n^2 \log n)\) in the worst case (every node sees all \(n\) samples), though in practice balanced splits reduce this significantly.
Stopping Criteria
Without constraints, a tree will keep splitting until every leaf is pure — memorizing the training data perfectly. This is a classic case of overfitting. Stopping criteria prevent this:
| Criterion | Effect |
|---|---|
| Max depth | Limits the number of levels in the tree |
| Min samples per leaf | Prevents leaves with too few samples |
| Min samples to split | A node must have at least this many samples to be considered for splitting |
| Min impurity decrease | A split must reduce impurity by at least this amount |
| Pure node | Recursion naturally stops when \(\text{Gini} = 0\) |
Prediction
To classify a new sample \(\mathbf{x}\), start at the root and follow the split rules at each internal node — go left if \(x_j < t\), right otherwise — until reaching a leaf. The leaf’s majority class is the prediction. For regression trees, the leaf returns the mean of its training targets.
Advantages and Limitations
Advantages:
- Interpretable — the tree can be visualized and explained to non-experts
- No feature scaling required — splits are based on thresholds, not distances
- Handles both numerical and categorical features
- Fast prediction: \(O(\text{depth})\) per sample
Limitations:
- High variance — small changes in data can produce very different trees
- Greedy splitting — each split is locally optimal, not globally
- Axis-aligned boundaries only — diagonal or curved boundaries require many splits to approximate
- Prone to overfitting without regularization
These limitations motivate ensemble methods: bagging (Random Forests) averages many decorrelated trees to reduce variance, while boosting (Gradient Boosted Trees) sequentially corrects errors to reduce bias.
Regression Trees
For regression, the tree predicts a continuous value instead of a class. The impurity measure changes from Gini to variance (or equivalently, mean squared error):
\[\text{Impurity}(\text{node}) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2\]
where \(\bar{y}\) is the mean target value in the node. The split that maximizes the reduction in variance is chosen, and each leaf predicts \(\bar{y}\) for its samples. Everything else — the greedy search, stopping criteria, pruning — works identically.